Skip to content

[BLIP] Fix daily CI failing test#20877

Merged
younesbelkada merged 9 commits intohuggingface:mainfrom
younesbelkada:blip-fix-tolerance
Jan 5, 2023
Merged

[BLIP] Fix daily CI failing test#20877
younesbelkada merged 9 commits intohuggingface:mainfrom
younesbelkada:blip-fix-tolerance

Conversation

@younesbelkada
Copy link
Contributor

What does this PR do?

This PR fixes: https://github.com/huggingface/transformers/actions/runs/3754402958/jobs/6378634199

Why this fix is relevant?

The reference logits for this test were obtained under pytorch==1.13.1+cu116 and the daily CI uses pytorch==1.13.0+cu116. Setting the tolerance slightly higher (4e-2) fixes the test to make it cross-versions compatible.

cc @LysandreJik @sgugger @ydshieh

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 22, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very big tolerance. It would be better to identify the layer in the model causing this problem.

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Dec 26, 2022

Hmm at the beginning I thought that the Softmax was causing the issue, leading to large round errors but the test pass locally with torch+cu116==1.13.0 but does not pass on the docker image that uses the same version. Will investigate more!


self.assertTrue(torch.allclose(torch.nn.Softmax()(out_itm[0].cpu()), expected_scores, atol=1e-3, rtol=1e-3))
self.assertTrue(torch.allclose(out[0].cpu(), torch.Tensor([[0.5053]]), atol=1e-3, rtol=1e-3))
self.assertTrue(torch.allclose(out_itm[0][0][0].cpu(), expected_scores))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can figure out why the previous test logic failed between the environment.
Let me know if I could help here, @younesbelkada :-)

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 2, 2023

On GCP (my own/ CI runners), all torch versions give

(torch 1.13.x)

[[0.97982633 0.02017363]]
[[0.50528485]]

or (torch 1.12.1)

[[0.97982633 0.02017365]]
[[0.5052849]]

so

[[0.9798, 0.0202]]
[[0.5053]]

will work. Not sure why you got larger differ though, but it is likely an env issue.

younesbelkada and others added 5 commits January 4, 2023 19:23
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
- add model.eval
- fix tolerance for GPU devices
@younesbelkada
Copy link
Contributor Author

younesbelkada commented Jan 4, 2023

Thanks so much @ydshieh 💯 , the tests seem to pass now on the CI docker image with your suggested values!
Seems that something was wrong with my env indeed

@younesbelkada younesbelkada requested a review from ydshieh January 4, 2023 19:47
@younesbelkada younesbelkada requested a review from sgugger January 5, 2023 08:15
Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💯 and thank you!

@younesbelkada younesbelkada merged commit bf82c9b into huggingface:main Jan 5, 2023
silverriver pushed a commit to silverriver/transformers that referenced this pull request Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants